Assignment 1



Tree Added by Party Package

knitr::include_graphics("tree_output.pdf")




Assignment 2


1. Read data from SENIC.txt into R

data <- read.delim("SENIC.txt", header = FALSE, sep = "")
names(data) <- c("Identification Number", "Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Medical School Affiliation", "Region", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")
Table 1.1 First 10 rows of the dataset
Identification Number Length of Stay Age Infection Risk Routine Culturing Ratio Routine Chest X-ray Ratio Number of Beds Medical School Affiliation Region Average Daily Census Number of Nurses Available Facilities & Services
1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241 60
2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52 40
3 8.34 56.9 2.7 8.1 74.0 107 2 3 82 54 20
4 8.95 53.7 5.6 18.9 122.8 147 2 4 53 148 40
5 11.20 56.5 5.7 34.5 88.9 180 2 1 134 151 40
6 9.76 50.9 5.1 21.9 97.0 150 2 2 147 106 40
7 9.68 57.8 4.6 16.7 79.0 186 2 3 151 129 40
8 11.18 45.7 5.4 60.5 85.8 640 1 2 399 360 60
9 8.67 48.2 4.3 24.4 90.8 182 2 3 130 118 40
10 8.84 56.3 6.3 29.6 82.6 85 2 1 59 66 40



2. Outliers Function

outlier_indices <- function(column) {
  q1 <- quantile(column)[2]
  q3 <- quantile(column)[4]
  return(column > q3+1.5*(q3-q1) | column < q1-1.5*(q3-q1))
}



3. Density plot of Infection risk with outliers


Comment:


Above figure shows 6 outliers on infection risk, three of them are grater than than q3+1.5(q3-q1), while the others are below q1-1.5(q3-q1). Infection risk appears to be approximately normally distributed with a mean around 4.5.



4. Density plots with outliers

Comment:

The variables Age, Infection Risk, Routine Chest X-ray Ratio and Available Facilities & Services are approximately normally distributed. Age and Infection Risk both have outliers that are greater than Q3+1.5(Q3-Q1) and smaller than Q1-1.5(Q3-Q1). Routine Chest X-ray Ratio only have one outlier that is greater than Q3+1.5(Q3-Q1), and the variable Available Facilities & Services have no outliers.

The distributions of the variables Length of Stay, Routine Culturing Ratio, Number of Beds, Average Daily Census and Number of Nurses are all right-skewed. Observations whose values are greater than Q3+1.5(Q3-Q1) exists in all of these variables.

5. Scatter plot showing the dependence of Infection risk on the Number of Nurses where points are colored by Number of Beds

Comment:


Above figure shows a positive relationship between Infection Risk and Number of Nurses, due to color scale saturation it’s hard to examine the graph.

A potential danger of using such color scale for the Number of Beds is that the outliers can have a big impact on the scale, which can make it hard to distinguish variation in Number of Beds among the other observations. In the figure above it can be hard to distinguish an observation with less than 100 beds from an observation with 200 beds.



6. ggplotly function

Comment:


Changing to ggplotly made it easy to navigate the density graph and shows what are the outliers specifically.



7. Histogram of Infection risk



Shiny App

data <- read.delim("SENIC.txt", header = FALSE, sep = "")
data <- data[,c(2:7, 10:12)]
names(data) <- c("Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")

outlier_indices <- function(column) {
    q1 <- quantile(column)[2]
    q3 <- quantile(column)[4]
    return(column > q3+1.5*(q3-q1) | column < q1-1.5*(q3-q1))
}

ui <- fluidPage(
    checkboxGroupInput("variable", "Select variables:",
                       c("Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")),
    sliderInput("bandwidth", "Bandwidth:", min = 1, max = 1000, value = 1),
    mainPanel(plotOutput("density_plots"))
)

server <- function(input, output, session) {
    output$density_plots <- renderPlot({
        if(length(input$variable) > 0) {
            plots <- lapply(input$variable, function(col) {
                p <- ggplot(data, aes(x = data[,col])) +
                    geom_density(bw = input$bandwidth) +
                    labs(x = col)
                if(nrow(data[outlier_indices(data[,col]),]) > 0) {
                    p <- p + geom_point(data = data[outlier_indices(data[,col]),], aes(x = data[outlier_indices(data[,col]),col], y = 0), shape = "diamond", size = 4)
                }
                p
            })
            plot(arrangeGrob(grobs = plots))
            }
        })
}

# Run the application 
shinyApp(ui = ui, server = server)

Comment:


A higher bandwidth value makes the density plot more smoothed, while a lower bandwidth value makes the density plot closer to the actual, observed data.

The optimal bandwidth depend on the graph’s scale, but we estimated that the most suitable bandwidths for all graph are between 0.3 and 0.4 times the interquartile range of the variable.

Appendix

library(ggplot2)
library(gridExtra)
library(plotly)

################
#### Step 1 ####
################

data <- read.delim("SENIC.txt", header = FALSE, sep = "")
names(data) <- c("Identification Number", "Length of Stay", "Age", "Infection Risk", "Routine Culturing Ratio", "Routine Chest X-ray Ratio", "Number of Beds", "Medical School Affiliation", "Region", "Average Daily Census", "Number of Nurses", "Available Facilities & Services")

################
#### Step 2 ####
################

outlier_indices <- function(column) {
  q1 <- quantile(column)[2]
  q3 <- quantile(column)[4]
  return(column > q3+1.5*(q3-q1) | column < q1-1.5*(q3-q1))
}

################
#### Step 3 ####
################

p_density <- ggplot(data, aes(x = data[,4])) +
  geom_density() +
  labs(x = names(data)[4]) +
  geom_point(data = data[outlier_indices(data[,4]),], aes(x = data[outlier_indices(data[,4]),4], y = 0), shape = "diamond", size = 4)
p_density

################
#### Step 4 ####
################

plots <- lapply(c(2:7, 10:12), function(col) {
  p <- ggplot(data, aes(x = data[,col])) +
    geom_density() +
    labs(x = names(data)[col])
  if(nrow(data[outlier_indices(data[,col]),]) > 0) {
    p <- p + geom_point(data = data[outlier_indices(data[,col]),], aes(x = data[outlier_indices(data[,col]),col], y = 0), shape = "diamond", size = 4)
  }
  p
  })

plot(arrangeGrob(grobs = plots))

################
#### Step 5 ####
################

p_scatter <- ggplot(data, aes(x = `Infection Risk`, y = `Number of Nurses`)) +
  geom_point(aes(col = `Number of Beds`))
p_scatter

################
#### Step 6 ####
################

ggplotly(p_density)

################
#### Step 7 ####
################

fig <- data %>%
  plot_ly(x = ~`Infection Risk`, type = "histogram") %>%
  add_markers(x = ~`Infection Risk`[outlier_indices(data[,4])], y = 0, marker = list(symbol = "diamond", size = 10)) %>%
  layout(xaxis = list(title = "Infection Risk"), yaxis = list(title = "Frequency"), showlegend = FALSE)
fig

Statement of Contribution


Simon and Mohamed devised the whole assignment together, the main conceptual ideas and codes outline. Mohamed worked out Assignment 1, tree visualization, and the report creation using r markdown, Simon worked out Assignment 2 and carried out all codes and functions, including the r shiny App.